{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "ced4739b45e6238e46d0e5849ccf8d7f",
     "grade": false,
     "grade_id": "cell-b81bc4fe3428522c",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "# Exercise 2: Markov Chains and Markov Decision Processes (MDP) \n",
    "\n",
    "This exercise deals with the formal handling of Markov chains and Markov decision processes. \n",
    "\n",
    "## 1) Markov Chain: State Transition\n",
    "The graph shows the last beer problem. \n",
    "The nodes show the states.\n",
    "The arrows define the possible transitions to other states and the numbers besides the arrows define the propability of the corresponding transition.\n",
    "If you are for example in the state \"Inital Beer\", with 30% propability you go to have a pizza, with 60% propability you meet friends and with 10% propability you end up sleeping.\n",
    "\n",
    "Define the state transition probability matrix $\\mathcal{P}_{xx'}$ of the graph shown in the figure below!\n",
    "\n",
    "![](Last_Beer_Graph.png)\n",
    "\n",
    "With $p_k = \\begin{bmatrix}\n",
    "\\text{Pr}_k \\lbrace \\text{Inital Beer} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Meet Friends} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Pizza} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Another Beer} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{\"Last Beer\"}\\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Sleep} \\rbrace \\\\\n",
    "\\end{bmatrix}^\\text{T}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "579529e1cd388d930772c89d6ab222c4",
     "grade": true,
     "grade_id": "cell-39a3b8abaeef6740",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "de9d879c95ad59853836d5cd34936fba",
     "grade": false,
     "grade_id": "cell-7558d8217d56a6ce",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 2 ) Markov Chain: Stationary State\n",
    "Using $p = p \\mathcal{P}$, calculate the stationary state probability.\n",
    "\n",
    "Please note that the sum of the state propabilities equals one for any specific point in time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "6376a3defa8c2066061932dd778121de",
     "grade": true,
     "grade_id": "cell-70c9b80b439d6bf9",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "8b2f2e53e9bbf3792bc98d29189bca30",
     "grade": false,
     "grade_id": "cell-79b786f9aa689326",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 3) Markov Reward Process: Evaluating States\n",
    "\n",
    "In the following rewards for every state are defined.\n",
    "\n",
    "Given the reward distribution $r_\\mathcal{X}$, calculate the state-values $v_\\mathcal{X}$.  \n",
    "\n",
    "The states are defined by:\n",
    "$\\mathcal{X} = \\left\\lbrace \\begin{matrix}\n",
    "\\text{Inital Beer}\\\\\n",
    "\\text{Meet Friends}\\\\\n",
    "\\text{Pizza}\\\\\n",
    "\\text{Another Beer}\\\\\n",
    "\\text{\"Last Beer\"}\\\\\n",
    "\\text{Sleep}\\\\\n",
    "\\end{matrix}\n",
    "\\right\\rbrace$\n",
    "\n",
    "The rewards are defined by:\n",
    "$r_\\mathcal{X} = \\begin{bmatrix}\n",
    "+1\\\\\n",
    "+1\\\\\n",
    "+2\\\\\n",
    "+1\\\\\n",
    "-3\\\\\n",
    "0\\\\\n",
    "\\end{bmatrix}$\n",
    "\n",
    "The state-value is defined by the state-value Bellman equation: $v_\\mathcal{X} = r_\\mathcal{X} + \\gamma \\mathcal{P}_{xx'} v_\\mathcal{X}$. Assume that $\\gamma = 0.9$ and write a Python program to calculate $v_\\mathcal{X}$. Which state is most promising? Why?\n",
    "\n",
    "Which state is most promising when $\\gamma = 0.1$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "5213b522b4e1146536ddae3105b0b1d5",
     "grade": true,
     "grade_id": "cell-670e272fcc752ccd",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "8c0c7f945e8bb714ad34de2baa9f3bca",
     "grade": false,
     "grade_id": "cell-065a09c73cd9262d",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# define given parameters\n",
    "gamma = 0.1 # discount factor\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()\n",
    "\n",
    "print(v_X)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "076a12f5ad402257a79d5d27dd363102",
     "grade": false,
     "grade_id": "cell-c36ce0660acbd6e3",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 4) Markov Decision Process: State Transition\n",
    "\n",
    "The graph shows an MDP.\n",
    "The nodes are the states. \n",
    "In every state you can choose between two actions (Lazy or Productive). \n",
    "Taken actions impact the state transition probability to the next state.\n",
    "If you for example have a \"Hangover\" and decide to be \"Productive\", there is a 30% chance for you to \"Visit Lecture\" and a 70% chance to stay in the \"Hangover\" state.\n",
    "\n",
    "Define the lazy state transition probabilitiy $\\mathcal{P}_{xx'}^{u=\\text{Lazy}}$ and the productive state transition probability $\\mathcal{P}_{xx'}^{u=\\text{Productive}}$ of the graph shown in the figure below.\n",
    "\n",
    "![](Hangover_MDP_Graph.png)\n",
    "\n",
    "With $p_k = \\begin{bmatrix}\n",
    "\\text{Pr}_k \\lbrace \\text{Hangover} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Sleep} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{More Sleep} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Visit Lecture} \\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Study}\\rbrace \\\\\n",
    "\\text{Pr}_k \\lbrace \\text{Pass the Exam} \\rbrace \\\\\n",
    "\\end{bmatrix}^\\text{T}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "f7001afecfbaaf428f614582d881f291",
     "grade": false,
     "grade_id": "cell-0476b4adb0cb2c85",
     "locked": true,
     "points": 0,
     "schema_version": 3,
     "solution": false,
     "task": true
    }
   },
   "source": [
    "## 4) Solution\n",
    "\n",
    "\n",
    "\\begin{align}\n",
    "\\mathcal{P}_{xx'}^{u=\\text{Lazy}}&=\\begin{bmatrix}\n",
    "0 & 1 & 0 & 0 & 0   & 0\\\\\n",
    "0 & 0 & 1 & 0 & 0   & 0\\\\\n",
    "0 & 0 & 1 & 0 & 0   & 0\\\\\n",
    "0 & 0 & 0 & 0 & 0.8 & 0.2\\\\ \n",
    "0 & 0 & 1 & 0 & 0   & 0\\\\\n",
    "0 & 0 & 0 & 0 & 0   & 1\\\\\n",
    "\\end{bmatrix}\\\\\n",
    "\\mathcal{P}_{xx'}^{u=\\text{Productive}}&=\\begin{bmatrix}\n",
    "0.7 & 0 & 0   & 0.3 & 0   & 0\\\\\n",
    "0   & 0 & 0.4 & 0.6 & 0   & 0\\\\\n",
    "0   & 0 & 0.5 & 0   & 0.5 & 0\\\\\n",
    "0   & 0 & 0   & 0   & 1   & 0\\\\ \n",
    "0   & 0 & 0   & 0   & 0.1 & 0.9\\\\\n",
    "0   & 0 & 0   & 0   & 0   & 1\\\\\n",
    "\\end{bmatrix}\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "6ed0b81e77e35f548c7881c032af2079",
     "grade": false,
     "grade_id": "cell-c972e01f5bf1aa95",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 5) Markov Decision Process: Trivial Policy Evaluation\n",
    "\n",
    "The rewards for this problem are defined by:\n",
    "$r_\\mathcal{X} = r_\\mathcal{X}^{u=\\text{Productive}} = r_\\mathcal{X}^{u=\\text{Lazy}} = \\begin{bmatrix}\n",
    "-1\\\\\n",
    "-1\\\\\n",
    "-1\\\\\n",
    "-1\\\\\n",
    "-1\\\\\n",
    "0\\\\\n",
    "\\end{bmatrix}$.\n",
    "\n",
    "How can we interprete these rewards?\n",
    "Evaluate both the lazy policy and the productive policy using $\\gamma = 0.9$.\n",
    "\n",
    "Bonus question: Can we evaluate the state-value of $\\lbrace x=\\text{More Sleep}, u=\\text{Lazy}\\rbrace$ for an infinite time horizon without the use of the Bellman equation?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "855973c4562306d45af3f2d745093492",
     "grade": true,
     "grade_id": "cell-ec5a4ac90a58ac12",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "be56ed5d7369dc269e1106a0f65a13e3",
     "grade": false,
     "grade_id": "cell-7da3b2398127ac8e",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "c63d308a4d7308fd29e67ea8d1ae8654",
     "grade": false,
     "grade_id": "cell-f336cf334f83fe6d",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Bonus question: Can we evaluate the state-value of {𝑥=More Sleep,𝑢=Lazy} for an infinite time horizon without the use of the Bellman equation?\n",
    "\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "49c226a3a6d95cc64f4a4f7b653caccd",
     "grade": false,
     "grade_id": "cell-0058e2bd53d74195",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 6) Action-Value Function Evalution\n",
    "\n",
    "Now, the policy is defined by:\n",
    "\\begin{align}\n",
    "\\pi(u_k=\\text{Productive} | x_k)&=\\alpha,\\\\\n",
    "\\pi(u_k=\\text{Lazy} | x_k)&=1-\\alpha, \\forall x_k \\in \\mathcal{X}\n",
    "\\end{align}\n",
    "\n",
    "Calculate action-values for the problem as described using the 'fifty-fifty' policy ($\\alpha = 0.5$) according to the Bellman Expectation Equation: $q_\\pi(x_k, u_k) = \\mathcal{R}^u_x + \\gamma \\sum_{x_{k+1} \\in \\mathcal{X}} p^u_{xx'} v_\\pi(x_{k+1})$ $\\forall x_k, u_k \\in \\mathcal{X}, \\mathcal{U}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "cd42e9c5a4e275441053a37b2afe98a5",
     "grade": false,
     "grade_id": "cell-63b67694ef859466",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 6) Solution\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "2e2b3a8c68024c1b52b37dee1f90affb",
     "grade": false,
     "grade_id": "cell-c39018db625f6a27",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "gamma = 0.9\n",
    "alpha = 0.5\n",
    "no_states = 6\n",
    "no_actions = 2\n",
    "r_X = np.array([-1, -1, -1, -1, -1, 0]).reshape(-1, 1)\n",
    "q_XU = np.zeros([no_states, no_actions])\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "25886df132196985ff25d234bc2c6d48",
     "grade": false,
     "grade_id": "cell-234818bf849f0638",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 7) Markov Decision Problem: Stochastic Policy Evalution\n",
    "\n",
    "Plot the state-value of the states \"Lecture\" and \"Study\" for different $\\alpha$. What do we see? Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "a2eb57d4066a1bc4ffd54af048e9916d",
     "grade": false,
     "grade_id": "cell-f8b7c0c069fbda39",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 7) Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "4b64f016a5738c629c66799fe67c16a6",
     "grade": false,
     "grade_id": "cell-58dae08eed6fcc5a",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('seaborn-v0_8-talk')\n",
    "\n",
    "n = 6 # dimension of state space\n",
    "no_of_samples = 1000\n",
    "\n",
    "alphas = np.linspace(0, 1, no_of_samples)\n",
    "v_n_alpha = np.zeros([n, no_of_samples])\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "49c919529b7b6c5aa09dde0b11555c23",
     "grade": false,
     "grade_id": "cell-ce3713de655bb104",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=[10, 6])\n",
    "states = [\"Hangover\", \"Sleep\", \"More Sleep\", \"Visit Lecture\", \"Study\", \"Pass Exam\"]\n",
    "alphas = alphas.flatten()\n",
    "for state, vnalp in zip(states, v_n_alpha):\n",
    "    ls = '--' if state in ['Visit Lecture', 'Study'] else '-'\n",
    "    plt.plot(alphas, vnalp, ls=ls, label=r\"$x=${}\".format(state))\n",
    "    \n",
    "plt.legend()\n",
    "plt.xlabel(r\"$\\alpha$\")\n",
    "plt.ylabel(r\"$v_\\pi(x)$\")\n",
    "plt.xlim([0, 1])\n",
    "plt.ylim([-10, 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "45c8dee229e204bd121effc21f77e5ca",
     "grade": true,
     "grade_id": "cell-af98cbff882e9d6f",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "RL25",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
